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(54) Title: ELIMINATION OF TRAPS AND ATOMICITY IN THREAD SYNCHRONIZATION 

o 

S (57) Abstract: Elimination of traps and atomics in thread synchronization is provided. In one embodiment, a processor includes 
a lock cache. The lock cache holds a value that corresponds to or identifies a computer resource only if a current thread executing 
on the processor owns the computer resource. A lock cache operation (e.g., a lockcachecheck instruction) determines whether a 
value identifying a computer resource is cached in the lock cache and returns a first predetermined value if the value identifying the 

^ computer resource is cached in the lock cache. Otherwise, a second predetermined value is returned. 
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ELIMINATION OF TRAPS AND ATOMICS IN THREAD SYNCHRONIZATION 
TECHNICAL FIELD 

The present invention relates to microprocessors, and more particularly, to locking of computer 
resources. 

BACKGROUND ART 

When different computer entities such as computer processes or threads share a computer resource (for 
example, data, code, or a piece of hardware), it may be desirable to allow one of the computer entities to lock a 
resource for a while to prevent some types of access to the resource by other computer entities. For example, if 
two or more threads share computer data, and one thread has started but not finished to modify the data when 
another thread is accessing the data, the other thread may get incorrect information from the data and/or the data 
could be corrupted by the two threads. Also, if one thread has started but not finished execution of a critical code 
section when another thread starts executing the same code section, execution errors may occur if, for example, 
the critical code section modifies the state of a data area, a hardware controller, or some other computer resource. 
Therefore, locking techniques have been provided to allow computer entities to lock computer resources. 

It is desirable to provide fast techniques for locking of computer resources. 

DISCLOSURE OF INVENTION 

The present invention provides elimination of traps and atomics in thread synchronization. Efficient and 
fast thread synchronization is provided. For example, thread synchronization can be efficient and fast in some 
frequently occurring situations, such as when a thread owns a computer resource (e.g., has a lock on a computer 
resource). 

In one embodiment, a lock cache (e.g., four registers) that holds up to four values, each of which is a 
reference to a locked object (e.g., an address of the locked object) is provided. The lock cache maintains 
LOCKCOUNTs. A lockcachecheck instruction (e.g., lockcachecheck object_address, Rd) returns a first 
predetermined value if the object_address is stored in the lock_cache (e.g., the current thread executing on the 
processor owns the object). Otherwise, the lockcachecheck instruction returns a second predetermined value 
(e.g., the lock cache is not currently storing the object_address, and the current thread may or may not own the 
object, which can be determined based on an access to memory that is handled in software). For example, if the 
first predetermined value is returned, then processing can continue with the current instruction stream (the 
current thread owns the object). If the second predetermined value is returned, then processing can branch to a 
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different set of instructions (e.g., to check a table in memory to determine whether the current thread owns the 
object, and if not, then the current thread can wait for the object to become available). 

Other features and advantages of the present invention are described below. 

BRIEF DESCRIPTION OF DRAWINGS 

5 FIG. 1 is a block diagram of a computer system including a processor according to the present 

invention. 

FIG. 2 is a block diagram showing registers that are used for locking operations in the processor of FIG. 
1, and also showing related data structures in the memory of the system of FIG. 1 . 

FIG. 3 is a block diagram showing data structures in the memory of FIG. 1 . 

10 FIG. 4 is a block diagram showing registers used for locking operations in a processor according to the 

present invention. 

FIG. 5 is a block diagram illustrating lock caches in accordance with one embodiment of the present 
invention. 

FIG. 6 is a functional diagram illustrating the operation of a lockcachecheck instruction in accordance 
15 with one embodiment of the present invention. 

FIG. 7 is a flow diagram illustrating the operation of a lockcachecheck instruction in accordance with 
one embodiment of the present invention. 

MODES FOR CARRYING OUT THE INVENTION 

FIG. 1 is a block diagram of a computer system including locking circuitry. Processor 1 10 is connected 
20 to memory 120 by bus 130. Processor 1 10 includes execution unit 136 which executes instructions read from 
memory 120. Execution unit 136 includes lock registers 144 labeled LOCKADDR, LOCKCOUNT. These 
registers are used for object locking as described below. 

Bus 130 is connected to I/O bus and memory interface unit 150 of processor 1 10. When processor 1 10 
reads instructions from memory 120, interface unit 150 writes the instructions to read instruction cache 156. 
25 Then the instructions are decoded by decode unit 160. Decode unit 160 sends control signals to execution 

control and microcode unit 166. Unit 166 exchanges control signals with execution unit 136. Decode unit 160 
also sends control signals to stack cache and stack cache control unit 1 70 (called "stack cache" below) or the 
register file in processors that do not include a stack cache. Stack cache 170 exchanges control and data signals 
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with execution unit 136 and data cache unit 180. Cache units 170 and 180 exchange data with memory 120 
through interface 150 and bus 130. Execution unit 136 can flush instruction cache 156, stack cache 170 and data 
cache 180. 

FIG. 2 illustrates registers 144 and one of the corresponding objects in memory 120. Registers 144 
5 include four register pairs labeled LOCKADDR0/LOCKCOUNT0 through LOCKADDR3/LOC KCOUNT3 . 
Each LOCKADDR register is to hold an address of a locked object. In a preferred embodiment, each address is 
32 bits wide, and accordingly each LOCKADDR register is 32 bits wide. However, in some embodiments, each 
object starts on a 4-byte boundary. Therefore, in some embodiments the two least significant bits of the object's 
address are zero, and are omitted from registers LOCKADDR. In such an embodiment, each register 
10 LOCKADDR is 30-bits wide. 

If a LOCKADDR register contains 0, this means the register pair is unused. 

In each register pair, the LOCKCOUNT register holds the count of lock instructions for the object 
whose address is held in the corresponding LOCKADDR register. The LOCKCOUNT register holds the number 
of those lock instructions for which a corresponding unlock instruction has not issued. The LOCKCOUNT 
15 register is incremented on each lock instruction for the object, and is decremented on each unlock instruction. 
The lock is actually freed only when the LOCKCOUNT register is decremented to zero. (However, in some 
embodiments, the LOCKCOUNT register holds only a portion of the lock count, as described below. The lock is 
freed when the entire lock count is decremented to zero.) In some embodiments, each LOCKCOUNT register is 
8-bits wide, to hold a number between 0 and 255. 

20 Multiple lock instructions without intervening unlock instructions may be a result of recursive code. 

Because the LOCKCOUNT registers keep the net count of the lock and unlock instructions for the objects 
(that is, the difference between the numbers of the lock instructions and the unlock instructions for the object), 
software programs are relieved from the need to do a test before each unlock instruction to determine whether the 
object was locked by some other part of the thread and should therefore remain locked until the need for that lock 

25 has expired. 

In some embodiments, registers 144 keep lock addresses and counts for one thread or one computer 
process only. When processor 1 10 switches to a different thread or process, registers 144 are loaded with lock 
data (lock addresses and counts) for the new thread or process which is to be executed. Accordingly, lock 
registers 144 allow processor 1 10 to avoid atomic operations when an address hit occurs. Atomic operations 
30 such as a swap operation, a compare and swap operation and a test and set operation can significantly reduce the 
performance of the processor. 
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FIG. 2 illustrates an object whose address is stored in a LOCKADDR register (register LOCKADDR3 in 
FIG. 2). In FIG. 2, the object is shown stored in memory 120. However, all or part of the object can be stored in 
data cache 180. Throughout this description, when we describe storing data or instructions in memory 120, it is 
to be understood that the data or instructions can be stored in data cache 1 80, stack cache 1 70 or instruction 
5 cache 156, unless mentioned otherwise. 

As shown in FIG. 2, the address in register LOCKADDR3 is a pointer to object structure 220. Object 
structure 220 starts with a header 220H. Header 220H is followed by other data (not shown). Header 220H 
includes a pointer to class structure 230 describing the object. Class structure 230 is aligned on a 4-byte 
boundary. As a result, and because all addresses are byte addresses with each successive byte having an address 
10 one greater than the preceding byte, the two LSBs of the class structure address are zero. These zero LSBs are 
not stored in header 220H. Therefore, the header has two bits not used for the address storage. These bits 
(header LSBs 0 and 1) are used for object locking. Bit 0, also called the L bit or the LOCK bit, is set to 1 when 
the object is locked. Bit 1, also called the W or WANT bit, is set to 1 when a thread is blocked waiting to acquire 
the lock for object 220. 

15 Tables A and B set forth exemplary pseudocode for circuitry that executes lock and unlock instructions 

for one embodiment of processor 1 10. That circuitry is part of execution unit 136 and/or execution control and 
microcode unit 166. The pseudocode language of Tables A and B is similar to the hardware description language 
Verilog® described, for example, in D.E. Thomas, J. P. Moorby, "The Verilog® Hardware Description 
Language" (1991). The pseudocode can be easily converted to Verilog, and the corresponding circuitry can be 

20 implemented using methods known in the art. 

Table A shows pseudocode for a lock instruction. At each of steps 1-0 through 1-3 in Table A, the 
contents of the corresponding register LOCKADDR0 through LOCKADDR3 are compared with the address of 
the object to be locked. If there is a match, the corresponding register LOCKCOUNT is incremented (steps 1-Oa, 
Ma, l-2a, l-3a) and compared with zero (steps l-0b, I -lb, !-2b, I -3 b). If the LOCKCOUNT register becomes 
25 0 after incrementation, an overflow has occurred, and a trap LockCountOverflowIncrementTrap is generated. 

Generation of a trap terminates execution of the instruction. If the trap is enabled, processor 1 10 starts executing 
a trap handler defined for the trap. As is well known in the art, a trap handler is a predefined set of computer 
code which executes when the trap occurs. 

In some embodiments, the trap handler for LockCountOverflowIncrementTrap maintains a wider lock 
30 counter m LOCKCOUNT (FIG. 3) than the LOCKCOUNT register. More particularly, in some embodiments, 
the operating system keeps track of locked objects using tables 3 10 in memory 120. A separate table 3 10 is kept 
for each thread. A table 3 10 is created for a thread when the thread is created, and the table 3 10 is deallocated 
when the corresponding thread is destroyed. Each table 3 10 includes a number of entries (mLOCKADDR, 
mLOCKCOUNT). The function of each entry is similar to the function of a register pair LOCKADDR/ 
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LOCKCOUNT. More particularly, mLOCKADDR holds the address of an object locked by the thread. 
mLOCKCOUNT holds the count of lock instructions issued by the thread for the object. The count of lock 
instructions is the number of the lock instructions for which a corresponding unlock instruction has not been 
executed. If some mLOCKADDR = 0, then the entry is unused. 

A table 310 may have more than four entries. Different tables 3 1 0 may have different numbers of 

entries. 

Each memory location mLOCKADDR is 32 or 30 bits wide in some embodiments. Each location 
mLOCKCOUNT is 8 or more bits wide. In some embodiments, each location mLOCKCOUNT is 32 bits wide, 
and each register LOCKCOUNT is 8 bits wide. 

When the operating system schedules a thread for execution, the operating system may load up to four 
entries from the corresponding table 310 into register pairs LOCK A DDR/LOCKCOUNT. Each entry is written 
into a single register pair LOCKADDR/LOCKCOUNT. If mLOCKCOUNT is wider than LOCKCOUNT, the 
operating system writes to LOCKCOUNT as many LSBs of mLOCKCOUNT as will fit into LOCKCOUNT (8 
LSBs in some embodiments). If some register pair does not receive an entry from table 310, the operating 
system sets the corresponding register LOCKADDR to 0 to indicate that the register pair is unused ("empty"). 

In some embodiments, table 310 includes a bit (not shown) for each entry to indicate whether the entry 
is to be written into a LOCKADDR/LOCKCOUNT register pair when the thread is scheduled for execution. In 
other embodiments, for each thread the operating systems keeps a list (not shown) of entries to be written to 
registers 144 when the thread is scheduled for execution. In some embodiments, the operating system has a bit 
for each entry, or a list of entries, to mark entries that have been written to LOCKADDR/LOCKCOUNT 
registers. 

In some cases, lock and unlock instructions do not cause a trap to be generated. Therefore, the 
mLOCKCOUNT LSBs may be invalid, or there may be no entry in a table 3 1 0 for a lock specified by a 
LOCKADDR/LOCKCOUNT register pair. 

When some thread Tl is preempted and another thread T2 is scheduled for execution on processor 1 10, 
the operating system writes all the non-empty LOCKADDR/LOCKCOUNT register pairs to the table 3 10 of 
thread Tl before loading the registers from the table 310 of thread T2. If mLOCKCOUNT is wider than 
LOCKCOUNT, the operating system writes each LOCKCOUNT register to the LSBs of the corresponding 
location mLOCKCOUNT. If the current thread's table 3 1 0 does not have an entry for a lock specified by a 
LOCKADDR/LOCKCOUNT register pair, an entry is created by the operating system. 

In some embodiments, the trap handler for LockCountOverflowTrap searches the table 310 of the 
current thread for the entry with mLOCKADDR containing the address of the object to be locked. If such an 
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entry does not exist, the trap handler finds a free entry, and sets its mLOCKADDR to the address of the object to 
be locked and mLOCKCOUNT to zero. In either case (whether the entry existed or has just been created), the 
trap handler increments the mLOCKCOUNT MSBs which are not stored in the LOCKCOUNT register, and sets 
the LSBs to zero. 

5 We now return to describing execution of the lock instruction by execution unit 136. In some 

embodiments, the comparisons of the registers LOCKADDR with the address of the object to be locked at steps 
1-0 through 1-3 of Table A are performed in parallel by four comparators corresponding to the four registers, and 
the incrementation of LOCKCOUNT at steps l-0a, 1-1 a, l-2a, l-3a is performed using incrementors. Such 
comparators and incrementors are known in the art. 

10 Execution unit 1 36 reads the LOCK bit (FIG. 2) from the header 220H of the object to be locked, and 

sets the LOCK bit to 1 to indicate that the object is locked (step 2a). This read-and-set (test-and-set) operation is 
an atomic operation, that is, (1) the processor will not take an interrupt until the operation is completed, and (2) 
in a multiprocessor environment, no other processor will be able to access the LOCK bit until the operation is 
completed. In some embodiments, this test-and-set operation is done in parallel with steps 1-0 through 1-3. In 

15 other embodiments, this test-and-set operation is done after steps 1-0 through 1-3, and only if none of the 
LOCKADDR registers contains the address of the object to be locked. 

If none of the LOCKADDR registers contains the address of the object to be locked (step 2), and the 
LOCK bit was set before the test-and-set operation (step 2a), processor 1 10 generates a trap LockBusyTrap. 
The trap handler for LockBusyTrap searches the table 3 10 of the current thread to see if the current 

20 thread holds the lock for the object. If the object address equals an address stored in mLOCKADDR in one of 
the entries of the table 3 10, the corresponding mLOCKCOUNT is incremented by the trap handler. Additionally, 
in some embodiments the trap handler may place the entry into a register pair LOCKADDR/LOCKCOUNT. 
This is desirable if the next lock or unlock instruction to be issued by the thread is likely to be for the object for 
which the thread issued the most recent lock instruction. If the trap handler desires to place the entry into a 

25 register pair but all the register pairs are taken by other locks, the trap handler vacates one of the register pairs by 
writing the register pair to the table 310. (The LOCKCOUNT register is written to the mLOCKCOUNT LSBs if 
mLOCKCOUNT is wider than LOCKCOUNT, as described above.) 

If the current thread does not hold the lock and thus the object address does not match any of the 
memory locations mLOCKADDR in the corresponding table 310, the trap handler sets the WANT bit in the 
30 object header (FIG. 2) and places the thread into a queue of threads waiting to acquire this lock. 

We return now to describing the execution of the lock instruction by execution unit 136. If the object's 
LOCK bit was not set before the test-and-set operation, steps 2b-0 through 2b-3 are executed. At each step 2b-i 
(i = 0 through 3), a respective comparator compares the register LOCKADDRi with zero. This comparison is 
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performed in parallel with comparisons of steps 1-0 through 1-3 and 2. If LOCKADDR0 = 0 (step 2b-0), the 
register pair LOCKADDR0/LOCKCOUNT0 is unused. Register LOCKADDRO is written with the address of 
the object being locked (step 2b-0a). The register LOCKCOUNTO is set to 1 (step 2b-0b). 

If LOCKADDRO is not 0 but LOCKADDR1 = 0, then register LOCKADDR 1 is written with the 
5 address of the object to be locked, and register LOCKCOUNT1 is set to 1 (steps 2b- la, 2b- lb). If 

LOCKADDRO and LOCKADDR1 are not 0 but LOCKADDR2 = 0, then LOCKADDR2 is written with the 
address of the object to be locked, and register LOCKCOUNT2 is set to 1 (steps 2b-2a, 2b-2b). If 
LOCKADDRO, LOCKADDR1 , and LOCKADDR2 are not 0 but LOCKADDR3 - 0, then register 
LOCKADDR3 is written with the address of the object to be locked, and register LOCKCOUNT3 is set to 1 
10 (steps 2b-3a, 2b-3b). 

If none of the LOCKADDR registers is equal to 0, then the trap NoLockAddrRegsTrap is generated 
(step 2c). In some embodiments, the trap handler for this trap finds or creates a free entry in the table 3 10 of the 
current thread. The trap handler writes the address of the object to be locked into location mLOCKADDR of that 
entry, and sets the corresponding mLOCKCOUNT to 1. Additionally, the trap handler may place the table entry 
15 into a LOCKADDR/LOCKCOUNT register pair. The old contents of the register pair are stored in the thread's 
table 310 before the register pair is written. It will be appreciated that other types of known replacement 
algorithms may also be used, e.g., a least recently used type of replacement algoritm or a random type of 
replacement algorithm. 

Table B shows exemplary pseudocode for the unlock instruction. At steps 1-0 through 1-3, the 
LOCKADDR registers are compared in parallel with the address of the object to be unlocked. If a match occurs, 
this indicates that the current thread holds the lock, and the corresponding LOCKCOUNT register is decremented 
by a decrementor (steps l-0a, 1-la, l-2a, l-3a) and compared with zero (steps l-0b, 1-lb, l-2b, l-3b). If the 
LOCKCOUNT register becomes 0 after decrementation, the trap LockCountZeroDecrementTrap is generated. 
As described above, in some embodiments, the locations mLOCKCOUNT in tables 3 10 are wider than the 
LOCKCOUNT register. In some such embodiments, the trap handler for LockCountZeroDecrementTrap 
searches the corresponding table 3 1 0 for an entry whose mLOCKADDR stores the address of the object being 
unlocked. If such entry is found, the trap handler checks the mLOCKCOUNT location corresponding to the 
LOCKCOUNT register which was decremented to 0. If that mLOCKCOUNT location has a "1" in the MSBs 
that were not written into the LOCKCOUNT register, the object remains locked by the thread. In the 
mLOCKCOUNT memory location the field formed by the MSBs is decremented, and the LSBs are set to 1 1 ...1 
(all I's) and are written to the LOCKCOUNT register. 

If the mLOCKCOUNT MSBs are all 0's, or if there is no entry with mLOCKADDR holding the address 
of the object being unlocked, then the trap handler frees the lock making it available for other threads. Freeing 
the lock is described in more detail below. 



25 
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if the mLOCKCOLTNT locations are not wider than the LOCKCOUMT registers, the trap handler need 
not check an mLOCKCOLTNT location to determine whether the lock is to be freed. 

Freeing the lock involves the following operations. The trap handler examines the WANT bit of object 
header 220H. If the WANT bit is set, another thread is blocking on this lock. The trap handler selects one of 

5 such threads, sets its status to runnable, and gives the lock to this thread. In particular, the trap handler writes the 
count of 1 into the LOCKCOUNT register. If there was a corresponding pair mLOCKADDR/mLOCKCOUNT, 
the trap handler writes 1 to the mLOCKCOUNT location. Alternatively, in some embodiments, the trap handler 
writes 0 to the mLOCKADDR location to deallocate the mLOCKADDR/mLOCKCOUNT pair. Further, if the 
thread receiving the lock is the only thread that has been blocking on the lock, the trap handler resets the WANT 

10 bit. 

If there were no threads blocking on the lock, the trap handler writes zero to (a) the corresponding 
LOCKADDR register and (b) the corresponding mLOCKADDR location if one exists. In addition, the trap 
handler resets the LOCK bit in header 220H. Also, if the current thread's table 310 includes a non-empty entry 
which could not be written into the LOCKADDR/LOCKCOUNT registers because the registers were 
1 5 unavailable, the trap handler places one of the entries into the LOCKADDR/LOCKCOUNT register pair which is 
being vacated by the lock freeing operation. 

If none of the LOCKADDR registers holds the address of the object to be unlocked (step 2), the 
LockReleaseTrap is generated. The associated trap handler searches the mLOCKADDR locations of the current 
thread's table 3 10 for the address of the object to be unlocked. If a match occurs, the corresponding location 

20 mLOCKCOUNT is decremented by the trap handler. If mLOCKCOUNT becomes 0, the lock is freed. To free 
the lock, the trap handler perform operations similar to those described above for the trap 
LockCountZeroDecrementTrap. More particularly, if the WANT bit is set, the trap handler finds another thread 
blocking on the lock and sets that thread's status to runnable. The trap handler sets the corresponding location 
mLOCKCOUNT to 1 . In some embodiments, the trap handler places the mLOCKADDR/mLOCKCOUNT entry 

25 into a LOCKADDR/LOCKCOUNT register pair. If the thread receiving the lock is the only thread that has been 
blocking on the lock, the trap handler resets the WANT bit. If there were no threads blocking on the lock (the 
WANT bit was 0), the trap handler writes zero to the mLOCKADDR location and resets the LOCK bit in object 
header 220H. 

If none of the memory locations mLOCKADDR in table 3 10 of the current thread holds the address of 
30 the object to be unlocked, the trap handler generates the exception IllegalMonitorStateException. In some 

embodiments, this exception is a Java™ throw. More particularly, in some embodiments, processor 1 10 executes 
Java™ Virtual Machine language instructions (also known as Java byte codes). The Java™ Virtual Machine 
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language is described, for example, in T. Lindholm and F. Yellin, "The Java™ Virtual Machine Specification" 
(1997) incorporated herein by reference. 

Processor 1 10 provides fast locking and unlocking in many of the following common situations: when 
there is no contention for a lock, and when a thread performs multiple lock operations on the same object before 
5 the object lock is freed. More particularly, when a lock instruction is issued, in many cases the object has not 
been locked by another thread (that is, no contention occurs). If the object has already been locked by the same 
thread that has now issued the lock instruction, in many cases the address of the object is already in a 
LOCKADDR register because in many cases the thread does not hold more than four locks at the same time and 
all the locked object addresses for the thread are in the LOCKADDR registers. Even if not all the locked object 
10 addresses are in the LOCKADDR registers, there is a possibility that the address of the object specified by the 
lock instruction is in a LOCKADDR register. In many such cases, the locking operation requires incrementing 
the corresponding LOCKCOUNT register (Table A, steps 1-ia where i = 0, I, 2, 3), which is a fast operation in 
many embodiments. If the incrementation does not lead to an overflow, no trap will be generated. 

Locking is also fast when the object has not been locked by any thread (including the thread issuing the 
15 lock instruction) if one of the register pairs LOCKADDR/LOCKCOUNT is unused. In such cases, the object is 
locked in one of steps 2b-0 through 2b-3 (Table A). Again, no trap is generated. 

Similarly, in an unlock instruction, in many cases the address of the object to be unlocked will be in one 
of the LOCKADDR registers. If the corresponding LOCKCOUNT register is decremented to a non-zero value, 
no trap is generated. 

20 In some embodiments, processor 1 10 is a microprocessor of type "picoJava P whose specification is 

produced by Sun Microsystems of Mountain View, California. This microprocessor executes Java Virtual 
Machine instructions. The lock instruction is the "monitorenter" instruction of the Java Virtual Machine 
instruction set or the H enter_sync_method" instruction of the processor "picoJava P. The "enter_sync_method" 
instruction is similar to "monitorexit" but the "enter_sync__method" instruction takes as a parameter a reference to 

25 a method rather than an object, "Enter_sync_method" locks the receiving object for the method and invokes the 
method. The unlock instruction is the "monitorexit" instruction of the Java Virtual Machine instruction set or the 
return instruction from a method referenced in a preceding "enter_sync_method" instruction. 

Some embodiments of processor 1 10 include more or less than four LOCKADDR/LOCKCOUNT 
register pairs. 

30 In some embodiments, registers 144 include register triples (THREADJD, LOCKADDR, 

LOCKCOUNT) as shown in FIG. 4. In each triple, the register THREADJD identifies the thread which holds 
the lock recorded in the register pair LOCKADDR/LOCKCOUNT. When a lock or unlock instruction is issued, 
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execution unit 136 examines only those LOCKADDR/LOCKCOUNT pairs for which the register THREADJD 
holds the ID of the current thread. In other respects, the execution of lock and unlock instructions is similar to 
the case of FIG. 2. The structure of FIG. 4 makes it easier to keep the locked objects' addresses and lock counts 
in registers 144 for different threads at the same time. In some embodiments used with the structure of FIG. 4, 
5 the operating system does not reload the registers 144 when a different thread becomes scheduled for execution. 
The operating system maintains a table 3 10 for each thread as shown in FIG. 3. When a register triple needs to 
be vacated, the corresponding LOCKADDR/LOCKCOUNT values are written to the corresponding table 310. 
When a table entry is placed into a register pair LOCKADDR/LOCKCOUNT, the corresponding register 
THREADJD is written with the ID of the corresponding thread. 

10 The processors of FIGs. 1-4 are suitable for efficient implementation of the Java Virtual Machine lock 

and unlock instructions "monitorenter" and "monitorexit". The counters associated with the object monitors in 
Java can be implemented using registers LOCKCOUNT. 

In some embodiments, registers LOCKCOUNT and locations mLOCKCOUNT are omitted. The 
processor does not keep track of the lock counts, and the processor frees a lock on any unlock instruction 
15 corresponding to the lock. The processor operation is similar to the operation described above in connection with 
Tables A and B. However, in Table A, steps 1-0 through l-3b are omitted. Steps 2b-0b, 2b-lb, 2b-2b, and 2b-3b 
(LOCKCOUNT operations) are also omitted. In Table B, step l-0a is omitted, and at step l-0b the trap 
LockCountZeroDecrementTrap is generated unconditionally. The same applies to steps 1-1 a and Mb, l-2a and 
l-2b, l-3aand l-3b. 

20 In some embodiments, each LOCKCOUNT register is 1-bit wide, and the processor frees a lock on any 

unlock instruction corresponding to the lock. 

The monitorenter bytecode instruction of the Java™ VM requires LOCKCOUNTs. Thus, for a 
microprocessor that implements a Java™ Virtual Machine (i.e., a bytecode engine), it is advantageous to 
• implement the monitorenter bytecode instruction of the Java™ VM by providing a lock cache that holds 
25 LOCKCOUNTs. 

However, in an instruction set architecture of a general microprocessor, locking of computer resources 
may not require LOCKCOUNTs. Moreover, elimination of atomics and traps in thread synchronization is 
desired. Accordingly, an efficient and fast locking mechanism for a general microprocessor that checks whether 
a current thread executing on a processor owns a computer resource is provided. In one embodiment, an 
30 instruction for a microprocessor is provided to determine whether a current thread executing on a processor of 
the microprocessor owns a computer resource by checking a lock cache of the processor. 
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FIG. 5 is a block diagram illustrating lock caches 512 and 514 in accordance with one embodiment of 
the present invention. PI processor 502 of multiprocessor 500 includes a lock cache 5 12 and an execution unit 
516 executing, for example, thread_A. P2 processor 504 includes a lock cache 5 14 and an execution unit 5 1 8 
executing, for example, thread_B. Memory 120 stores an object header 220H as similarly discussed above. In 
one embodiment, lock caches 5 1 2 and 5 1 4 each include four registers to provide four-entry caches, which are 
disposed on execution units 516 and 518, respectively. 

In this embodiment, LOCKCOUNTs are not stored (e.g., cached) in the lock caches. The operation of 
multiprocessor 500 using lock caches 512 and514 is discussed below with respect to FIGs. 7 and 8. The 
operation of maintaining memory 120 and loading the lock caches with the appropriate information from 
memory 120 upon beginning execution of a new thread can be similar to the implementation and use of memory 
120 discussed above. 

In particular, lock caches 5 1 2 and 5 1 4 cache an object address only if the current thread executing on 
the respective processor owns the object (e.g., has locked the object). No lock count is maintained in lock caches 
512 and 514. 

Referring to FIG. 6, a lockcachecheck instruction 602 includes an 8-bit lockcachecheck opcode 604, a 
7-bit object_address 606 (e.g., specified by a register specifier rsl), and a 7-bit register destination (Rd) 608 for 
the value returned by execution of the lockcachecheck instruction. For example, the lockcachecheck instruction 
can be executed on PI processor 602 of FIG. 6 in one cycle which is performed in the E stage of the pipeline. 

The execution of the lockcachecheck instruction performs the following operation: the object_address is 
compared with each entry of lock cache 612 to determine if the object_address, which corresponds to a computer 
resource, is held in the lock cache (i.e., a cache hit/miss is determined). In a preferred embodiment, lock cache is 
implemented with a content addressable memory (CAM) or other associateve memory because the memory 
determines cache hits in parallel. The result of the hit/miss operation is returned to Rd 608. For example, Rd is 
set to 0 if the lockcachecheck operation results in a miss, and Rd is set to 1 if the lockcachecheck operation 
results in a hit. One of ordinary skill in the art will recognize that there are various way to implement the 
circuitry for performing the operation of the lockcachecheck instruction in a microprocessor, such as the 
microprocessor of FIG. 5. 

Referring to FIG. 7, a lockcachecheck instruction is executed on PI processor 602 of FIG. 6 at stage 
702. At stage 704, the lockcachecheck instruction returns 0 (i.e., Rd » 0) if the object (i.e., address_ofobject) is 
not cached in lock cache 712 (i.e., a cache miss). The lockcachecheck instruction returns 1 (i.e., Rd = 1) if the 
object (i.e., address_ofobject) is cached in lock cache 712 (i.e., a cache hit). The operation of stage 704 can be 
implemented in hardware in a microprocessor, such as PI processor 602 of FIG. 6. At stage 706, if Rd = 0, then 
a branch operation, for example, is taken to attempt to acquire the lock. Thus, software is responsible for 



WO 00/033195 



PCT/US99/28875 



' 12- 

handling a lock cache miss (e.g., the software can determine if the current thread owns the object by checking a 
table maintained in memory, and if not, then the software can branch to a loop that waits for the object to become 
available). Otherwise (i.e., Rd = 1), PI processor 602 continues executing the current instruction stream, because 
the current thread executing on processor PI 602 owns the object. If the lock count is not incremented in 
5 hardware, then software may increment the lock count. 

Accordingly, in the embodiments described above with respect to FIGs. 5 through 7, elimination of 
atomics and traps in thread synchronization is provided. For example, a lockcachecheck instruction can be 
provided for a general microprocessor in which it is desirable to provide for an efficient implementation of 
locking of computer resources. 

10 The above embodiments illustrate but do not limit the present invention. The present invention is not 

limited by any particular processor architecture, the presence or structure of caches or memory, or the number of 
bits in any register or memory location. The present invention is not limited to any particular types of objects 
that can be locked or unlocked. An object can represent any computer resource, including such resources as data, 
critical code sections, hardware, or any combination of the above. Some embodiments create an object dedicated 

1 5 to represent a computer resource for locking and unlocking operations. While in embodiments described above 
an unused register LOCKADDR is identified by zero in the LOCKADDR register, in some embodiments an 
unused register is identified by some non-zero value in the LOCKADDR register, by some value in the 
THREAD ! D register, by a separate bit, or some combination thereof. A similar statement is true for unused 
mLOCKADDR locations. In some embodiments, some operations described above as performed by hardware 

20 are performed by software instead of hardware. The present invention is not limited to addresses being byte 
addresses. Other embodiments and variations are within the scope of the present invention, as defined by the 
appended claims. 
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TABLE A 
Lock Instruction 

1 -0. if (LOCKADDR0 — address of object to be locked) 
{ 

I-Oa. LOCKCOUNT0++; 

1-0b. if (LOCKCOUNT0 = 0) /* LOCKCOUNT0 overflowed*/ 

LockCountOverflowIncrementTrap; 

} 

1-1. if (LOCKADDR1 = address of object to be locked) 
{ 

Ma. LOCKCOUNT1 ++; 

1-lb. if (LOCKCOUNT1 == 0) /* LOCKCOUNT! overflowed*/ 

LockCountOverflowIncrementTrap; 

} 

1-2. if (LOCKADDR2 = address of object to be locked) 
{ 

l-2a. LOCKCOUNT2++; 

1 -2b. if (LOCKCOUNT2 = 0) /* LOCKCOUNT2 overflowed*/ 

LockCountOverflowIncrementTrap; 

} 

1-3. if (LOCKADDR3 — address of object to be locked) 
{ 

l-3a. LOCKCOUNT3++; 

1 -3b. if (LOCKCOUNT3 0) /* LOCKCOUNT3 overflowed*/ 

LockCountOverflowIncrementTrap; 

} 

2. if (none of LOCKADDR0, LOCKADDR I , LOCKADDR2, LOCKADDR3 is equal to address of object 
to be locked) 

{ 

2a. Test the LOCK bit in the object header, and set the LOCK bit to 1 . (This test-and-set 

operation is an atomic operation.) 

if (the LOCK bit was set before the test-and-set operation) 
LockBusyTrap; 

2b-0. else if (LOCKADDR0 == 0) /* LOCKADDR0 unused */ 

{ 

2b-0a. LOCKADDR0 = address of object to be locked; 

2b-0b. LOCKCOUNT0=l; 

} 

2b- 1 . else if (LOCKADDR 1 — 0) /* LOCKADDR 1 unused */ 

{ 

2b- 1 a. LOCKADDR 1 - address of object to be locked; 

2b- lb. LOCKCOUNT1 = 1; 

} 

2b-2. else if (LOCKADDR2 = 0) /* LOCKADDR2 unused */ 

{ 

2b-2a. LOCKADDR2 = address of object to be locked; 
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2b-2b. LOCKCOUNT2= 1; 

} 

2b-3. else if (LOCKADDR3 = 0) /* LOCKADDR3 unused V 

{ 

2b-3a. L0CKADDR3 = address of object to be locked; 

2b-3b. LOCKCOUNT3 = 1; 

} 

2c. else NoLockAddrRegsTrap; 

} 
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TABLE B 
Unlock Instruction 

1-0. if (LOCKADDR0 = address of object to be unlocked) 
{ 

l-0a. LOCKCOUNT0-; 

1 -Ob. if (LOCKCOUNT0 == 0) 

LockCountZeroDecrementTrap; 

} 

1-1. if (LOCKADDR1 — address of object to be unlocked) 
{ 

1-1 a. LOCKCOUNT1-; 
1-lb: if (LOCKCOUNT1 == 0) 

LockCountZeroDecrementTrap; 

} 

1 -2. if (LOCKADDR2 = address of object to be unlocked) 
{ 

l-2a. LOCKCOUNT2-; 

I -2b. if (LOCKCOUNT2 = 0) 

LockCountZeroDecrementTrap; 

} 

1-3. if (LOCKADDR3 = address of object to be unlocked) 
{ 

l-3a. LOCKCOUNT3--; 
l-3b. if(LOCKCOUNT3=~0) 

LockCountZeroDecrementTrap; 

} 

2. if (none of LOCKADDR0, LOCKADDR1 , LOCKADDR2, LOCKADDR3 is equal to address of object 
to be unlocked) 

LockReleaseTrap 
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WE CLAIM 

1 1. An apparatus, comprising: 

2 a lock cache, the lock cache holding values that identify locks for computer resources; and 

3 logic circuitry, the logic circuitry connected to the lock cache, and the logic circuitry performing an 

4 operation that determines whether a computer resource is held in the lock cache by receiving a 

5 value VI identifying the computer resource and then determining whether the lock cache holds 

6 the value VI. 

1 2. The apparatus of Claim 1 wherein the logic circuitry comprises: 

2 a comparator for comparing the value V 1 with a value V2 held in the lock cache. 

1 3. The apparatus of Claim 1 wherein the logic circuitry returns a first predetermined value if the 

2 lock cache is not holding the value VI. 

1 4. The apparatus of Claim 1 wherein the logic circuitry returns a second predetermined value if 

2 the lock cache is holding the value V 1 . 

» 

1 5. The apparatus of Claim 1 wherein the logic circuitry comprises a computer processor, and the 

2 lock cache comprises registers of an execution unit of the computer processor. 

1 6. A process, comprising: 

2 receiving a value VI identifying a computer resource; and 

3 determining if a lock cache of a computer processor holds the value VI, wherein the lock cache holding 

4 the value VI indicates that the computer processor owns the computer resource. 

1 7. The process of Claim 6 further comprising: 

2 returning a first predetermined value if no entry of the lock cache holds the value VI . 

1 8. The process of Claim 7 further comprising: 

2 returning a second predetermined value if the lock cache holds the value VI. 



1 9. The process of Claim 7 wherein the process comprises execution of an instruction on the 

2 computer processor of a microprocessor. 
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1 10, An apparatus, comprising: 

2 a lock cache of a computer processor of a microprocessor, the lock cache holding values that identify 

3 computer resources owned by the computer processor; and 

4 means for performing a lock cache check operation that determines whether a value VI corresponding 

5 to a computer resource is held in the lock cache. 

1 11. The apparatus of Claim 10 wherein the means for performing a lock cache check operation 

2 further comprises: 

3 means for comparing the value V 1 with each value held in the lock cache. 

1 12. The apparatus of Claim 1 1 wherein the means for performing a lock cache check operation 

2 further comprises: 

3 means for returning a first predetermined value if the lock cache is not holding the value V 1 . 

1 13. The apparatus of Claim 1 2 wherein the means for performing a lock cache check operation 

2 further comprises: 

3 means for returning a second predetermined value if the lock cache is holding the value VI . 

1 14. The apparatus of Claim 13 wherein the means for performing a lock cache check operation 

2 comprises the microprocessor executing a lock cache check instruction. 

1 1 5. The apparatus of Claim 14 wherein the lock cache comprises four entries. 

1 1 6. The apparatus of Claim 14 wherein the lock cache comprises registers of an execution unit of 

2 the processor. 

1 1 7. The apparatus of Claim 1 0 wherein the lock cache holds the value VI that corresponds to the 

2 computer resource. 

1 18. The apparatus of Claim 1 7 wherein the computer resource comprises an object. 

1 1 9. The apparatus of Claim 1 8 wherein the object comprises a data structure. 

1 20. The apparatus of Claim 10 wherein the lock cache comprises object addresses, the object 

2 addresses being addresses of objects owned by a thread executing on the processor. 
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